STATS 32 Session 4: Data Transformation

Kenneth Tay

Oct 11, 2018

Recap of session 3: data visualization

ggplot2 syntax

library(ggplot2)
ggplot()

ggplot2 syntax

ggplot() +
    geom_violin(data = mtcars, 
                mapping = aes(x = factor(cyl), y = hp))

ggplot2 syntax

ggplot() +
    geom_violin(data = mtcars, 
                mapping = aes(x = factor(cyl), y = hp)) +
    geom_point(data = mtcars, 
               mapping = aes(x = factor(cyl), y = hp),
               position = "jitter")

ggplot2 syntax

ggplot(data = mtcars, 
       mapping = aes(x = factor(cyl), y = hp)) +
    geom_violin() +
    geom_point(position = "jitter")

ggplot2 syntax

ggplot(data = mtcars, 
       mapping = aes(x = factor(cyl), y = hp)) +
    geom_violin() +
    geom_point(position = "jitter") +
    labs(title = "Horsepower vs. Cylinder", x = "Cylinder", 
         y = "Horsepower")

ggplot2 syntax

ggplot(data = mtcars, 
       mapping = aes(x = factor(cyl), y = hp)) +
    geom_violin() +
    geom_point(position = "jitter") +
    labs(title = "Horsepower vs. Cylinder", x = "Cylinder", 
         y = "Horsepower") +
    theme_classic()

Agenda for today

Why do we need to transform data?

We rarely get data in exactly the form we need!

Transforming data in R is made easy by the dplyr package (“official” cheat sheet available here).

The 5 basic dplyr verbs

Toy example: Student scores

scores
##     Name English Math Science History Spanish
## 1 Andrew      60   96      80      56      77
## 2   John      66   55      56      64      77
## 3   Mary      92   63      70      62      98
## 4   Jane      80   76      89      55      40
## 5    Bob      80   80      82      48      50
## 6    Dan      58   52      79      90      61

select: pick subset of variables/columns by name

History teacher: “I just want their names and History scores”

  1. Take the scores dataset.
  2. Select the Name and History columns.

mutate: create new columns based on old ones

Form teacher: “What are their total scores?”

  1. Take the scores dataset.
  2. Add a new column by mutating existing columns: for each row, Total = English + Math + Science + History + Spanish

arrange: reorder rows

Form teacher: “Can I have the students in order of overall performance?”

  1. Take the scores dataset.
  2. Arrange rows by the Total column.

arrange: reorder rows

Form teacher: “No no, better students on top please…”

  1. Take the scores dataset.
  2. Arrange rows by the Total column, but in descending order.

arrange: reorder rows

Form teacher: “Can I have them in descending order of total scores, but if students tie, then by alphabetical order?”

  1. Take the scores dataset.
  2. Arrange rows first by the Total column in descending order, then by the Name column (in ascending order).

filter: pick observations by their values

History teacher: “I want to see which students scored less than 60 for history”

  1. Take the scores dataset.
  2. Filter for rows where the value in the History column is less than 60.

summarize: get summaries of data

Academic: “I want to know the correlation between math and science scores”

  1. Take the scores dataset.
  2. Summarize the dataset by taking the correlation of the Math and Science columns.

Science teacher: “I want to know the mean and standard deviation of the scores for science”

  1. Take the scores dataset.
  2. Summarize the dataset by taking the mean of the Science column and the standard deviation of the science column.

group_by: use dplyr verbs on a group-by-group basis

Academic: “I want to know if the boys scored better than the girls in Spanish”

  1. Take the scores dataset.
  2. Group the dataset rows by gender (assuming there is a gender column).
  3. Summarize each group of the dataset by taking the mean of the Spanish column.

Practice

Language teacher: “I want to know which students scored < 70 for both English and Spanish, but I just want names”

Practice

Language teacher: “I want to know which students scored < 70 for both English and Spanish, but I just want names”

  1. Take the scores dataset.
  2. Filter for the rows which have English < 70 and Spanish < 70.
  3. Select the Name column.

Practice

Math teacher: “I want to know which students scored < 70 for math, and I just want their names and their mean score across subjects”

Practice

Math teacher: “I want to know which students scored < 70 for math, and I just want their names and their mean score across subjects”

  1. Take the scores dataset.
  2. Filter for the rows which have Math < 70.
  3. Mutate existing columns to get a new one: Mean = (English + … + Spanish)/5.
  4. Select the Name and Mean columns.

select: pick subset of variables/columns by name

History teacher: “I just want their names and History scores”

  1. Take the scores dataset.
  2. Select the Name and History columns.
scores %>%
    select(Name, History)
##     Name History
## 1 Andrew      56
## 2   John      64
## 3   Mary      62
## 4   Jane      55
## 5    Bob      48
## 6    Dan      90

mutate: create new columns based on old ones

Form teacher: “What are their total scores?”

  1. Take the scores dataset.
  2. Add a new column by mutating existing columns: for each row, Total = English + Math + Science + History + Spanish
scores <- scores %>%
    mutate(Total = English + Math + Science + History + Spanish)
scores
##     Name English Math Science History Spanish Total
## 1 Andrew      60   96      80      56      77   369
## 2   John      66   55      56      64      77   318
## 3   Mary      92   63      70      62      98   385
## 4   Jane      80   76      89      55      40   340
## 5    Bob      80   80      82      48      50   340
## 6    Dan      58   52      79      90      61   340

arrange: reorder rows

Form teacher: “Can I have the students in order of overall performance?”

  1. Take the scores dataset.
  2. Arrange rows by the Total column.
scores %>%
    arrange(Total)
##     Name English Math Science History Spanish Total
## 1   John      66   55      56      64      77   318
## 2   Jane      80   76      89      55      40   340
## 3    Bob      80   80      82      48      50   340
## 4    Dan      58   52      79      90      61   340
## 5 Andrew      60   96      80      56      77   369
## 6   Mary      92   63      70      62      98   385

arrange: reorder rows

Form teacher: “No no, better students on top please…”

  1. Take the scores dataset.
  2. Arrange rows by the Total column, but in descending order.
scores %>%
    arrange(desc(Total))
##     Name English Math Science History Spanish Total
## 1   Mary      92   63      70      62      98   385
## 2 Andrew      60   96      80      56      77   369
## 3   Jane      80   76      89      55      40   340
## 4    Bob      80   80      82      48      50   340
## 5    Dan      58   52      79      90      61   340
## 6   John      66   55      56      64      77   318

arrange: reorder rows

Form teacher: “Can I have them in descending order of total scores, but if students tie, then by alphabetical order?”

  1. Take the scores dataset.
  2. Arrange rows first by the Total column in descending order, then by the Name column (in ascending order).
scores %>%
    arrange(desc(Total), Name)
##     Name English Math Science History Spanish Total
## 1   Mary      92   63      70      62      98   385
## 2 Andrew      60   96      80      56      77   369
## 3    Bob      80   80      82      48      50   340
## 4    Dan      58   52      79      90      61   340
## 5   Jane      80   76      89      55      40   340
## 6   John      66   55      56      64      77   318

filter: pick observations by their values

History teacher: “I want to see which students scored less than 60 for history”

  1. Take the scores dataset.
  2. Filter for rows where the value in the History column is less than 60.
scores %>%
    filter(History < 60)
##     Name English Math Science History Spanish Total
## 1 Andrew      60   96      80      56      77   369
## 2   Jane      80   76      89      55      40   340
## 3    Bob      80   80      82      48      50   340

More on filters

Other ways to make comparisons:

Combining comparisons:

More filter examples

Dan’s parents: “I just want Dan’s scores”

scores %>% 
    filter(Name == "Dan")
##   Name English Math Science History Spanish Total
## 1  Dan      58   52      79      90      61   340

Language teacher: “I want to know which students score < 50 for either English or Spanish”

scores %>% 
    filter(English < 50 | Spanish < 50)
##   Name English Math Science History Spanish Total
## 1 Jane      80   76      89      55      40   340

summarize: get summaries of data

Academic: “I want to know the correlation between math and science scores”

  1. Take the scores dataset.
  2. Summarize the dataset by taking the correlation of the Math and Science columns.
scores %>%
    summarize(corr = cor(Math, Science))
##        corr
## 1 0.5470561

summarize: get summaries of data

Science teacher: “I want to know the mean and standard deviation of the scores for science”

  1. Take the scores dataset.
  2. Summarize the dataset by taking the mean of the Science column and the standard deviation of the science column.
scores %>%
    summarize(Science_mean = mean(Science), 
              Science_sd = sd(Science))
##   Science_mean Science_sd
## 1           76   11.54123

group_by: use dplyr verbs on a group-by-group basis

Academic: “I want to know if the boys scored better than the girls in Spanish”

  1. Take the scores dataset.
  2. Group the dataset rows by gender (assuming there is a gender column).
  3. Summarize each group of the dataset by taking the mean of the Spanish column.
scores %>%
    group_by(Gender) %>%
    summarize(Spanish_mean = mean(Spanish))
## # A tibble: 2 x 2
##   Gender Spanish_mean
##   <chr>         <dbl>
## 1 F              69  
## 2 M              66.2

Chaining multiple dplyr commands

Language teacher: “I want to know which students scored < 70 for both English and Spanish, but I just want names”

  1. Take the scores dataset.
  2. Filter for the rows which have English < 70 and Spanish < 70.
  3. Select the Name column.
scores %>%
    filter(English < 70 & Spanish < 70) %>%
    select(Name)
##   Name
## 1  Dan

Chaining multiple dplyr commands

Math teacher: “I want to know which students scored < 70 for math, and I just want their names and their mean score across subjects”

  1. Take the scores dataset.
  2. Filter for the rows which have Math < 70.
  3. Mutate existing columns to get a new one: Mean = (English + … + Spanish)/5.
  4. Select the Name and Mean columns.
scores %>%
    filter(Math < 70) %>%
    mutate(Mean = (English + Math + Science + History + Spanish)/5) %>%
    select(Name, Mean)
##   Name Mean
## 1 John 63.6
## 2 Mary 77.0
## 3  Dan 68.0

Today’s dataset: Flights from New York City

Exploring flights from NYC to the Bay Area
(Source: Sunday Express)









Optional material

transmute: create new columns based on old ones, discard old ones

Form teacher: “I just want the mean score for each student”

scores %>% 
    transmute(mean = (English + Math + Science + History + Spanish) / 5)

TRUE/FALSE statements

How does R understand the code filter(History < 60)?

TRUE/FALSE examples

3 > 2
## [1] TRUE
3 < 2
## [1] FALSE
3 == 2
## [1] FALSE
c(1, 2, 3, 1) == c(3, 2, 1, 2)
## [1] FALSE  TRUE FALSE FALSE
c(1, 2, 3, 1) == 1
## [1]  TRUE FALSE FALSE  TRUE

Be careful with NAs!

1 == NA
## [1] NA
NA == NA
## [1] NA
is.na(NA)
## [1] TRUE